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ABSTRACT 

.Recent research (1978-1982) on student. evaluations of 
teaching is reviewed, including: influence of background variables 
pertaining to the student, the teacher, and the learning environment; 
the dimensions of the teaching being evaluated; the validity of 
students' evaluations; the "Doctor Fox" effect and its implications 
for validity; the reliability, stability, generaliMbility , and 
usefulness of students' evaluations; and the construction and 
selection of evaluation questionnaires. Dimensions of teaching that 
students evaluate may include: skill, rapport, assignments, breadth 
ot coverage, tests and grading, group interaction, enthusiasm, and 
organization. The extent to which students ' ^ivaluat ions of faculty 
correlate with variables thought to reflect effective teaching may be 
addressed by considering the following criteria: student achievement 
instructor self-evaluations, and improved student attitudes toward ' 
the subject.. A number of studies have examined the: "Doctor Fox" 
effect: the possibility that student assessments of teacher 
effectiveness are moire a function of an instructor's wit and 
personality than of the educational content of the lecture. It is 
concluded that the research indicates that (1) evaluations are not 
significantly influenced by background variables, and are valid 
reliable, stable, generalizable , and useful, and (2) properly ' 
constructed evaluation questionnaires assess multiple dimensions of 
the instructional process. (SW) 
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rating studies, and what the most recent research in- 
dicati:s: These are questions that involve all of us "in 
higher education and hence merit an updated analysis. 

There is an extensive and stilh expanding literature 
base that examines students' evaluations, i With few ex- 
ceptions, current research is defined as that reported in 
• the five-year period of 1978 to thcpresent. While some 
^excellent research is thus not specifically cited, it is re- 
fcrrcd to in the studie.s discussed below. Serious re- 
searchers are urged to pursue those primary sources on 
their own. 

There are several major areas of consiaeration in any 
discussion of student ratings of courses and instructors: 
influence of background variables; thedimensions gf the 
teaching being evaluated; the validity of students' evalu- 

<Wc recommend consulting the excellent general review articles 
(Aubredit, J981; Marsh, in press; rtnd, McKeachie, 19^9) or books 
(Centra, 1981; and Millman, 1981) recently published on the sub- 
ject. These sources have updated and expanded on the much cited 
I and influential research summaries by Costin, Greenough, and 
|a Mengcs (1971) and Kulik and McKeachie (1975). They provide a 
^ collection of research findings and outline individual studies whose 
\ findings contributed to the formation of research thought predomi- 
^ nant in that area. 

\i Should an area of interest be so specialized that it is not covered in 
one of the review sources listed above, or should the researcher wisli a 
K \ more comprehensive list of studies about a particular topic, the 
Educational Resources Information Center (ERIC) collection can be 
helpful. A search of the ERIC holdings will identify relevant artides 
^ from most journals plus important conference papers and institu- 
tional studies not othenvise published. 

Jesse U. Overall is manager of evaluation and personnel research, 
Onicc of Institutional Studies, at the University of Southern 
California. . ^ . . „ 

u«.K«rt VV. ^fars^l if»jor lecturer. Department of Education, 
V _.rsity ofSydncy,tSfdney, .\uslralia. 
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Background Variables 

To what extent have selected variables in the teaching 
and learning environment been found to be associated 
with student ratings? A great deal of research has fo- 
cuscd on the extent to which background variables such 
as class size, expected grade, reasons for takinga course, 
the instructor's research productivity, and both students' 
and instructors' personalities are related to students' 
evaluations of their courses and instructors. Most of this 
research has dealt with single background variables or 
combinations of one or twO variables. These approaches 
produce interesting but piecemeal conclusions, some of 
which are discussed further. 

Background variables under consideration include 
the following: administrative, course, instructor, and 
student. In addition, three multivariate studies that in- 
vestigated the relationship between marly of these vari- 
ables and instructional clTectiveness in a single setting 
are included. 

Administrative Variables. Evaluations appear to be 
sonriewhat higher if the student evaluator is identified 
or if the instructor is present when the evaluations are 
completed (Feldman, 1979). Also, if students believe 
that their evaluations will influence decisions on promo- 
tions, they tend to rate their instructors higher than if 
they believe their ratings Will be used solely for feedback 
or, instructional improvement purposes (McKeachie 
1979). 

Course Variables. Classes with very small numbers of 
students (1 to approximately 30) or very large classes 
(approximately 1 00 or more) tend to receive higher eval- 
u^ions thian those with enrollments between these fig- 
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urcs (Aleamoni/igai; Aubrccht, .1979;'eentra^:^1981^ 
and, Feldman, 1978). Courses in the humanities receive 
slightly higher ratings of overall efiectiveness when com- 
pared with those in the social sciences or natural 
sciences. Other average differences are found by dis- 
cipline, depending on the particular evaluation dimen- 
sion under analysis (Centra, 1981; Feldman, 1978). 

Instructor Variables. No specific personality .charac- 
teristics in an instructor are consistently or significantly 
related, to receipt of a high evaluation from students 
(McKeachie, 1979). Furthermore, no definite rela- 
tionship has been found between an instructor's rank 
and his or her evaluations, with the exception of 
\, teaching assistants (TAs). Students tend to rate TAs 
lower than faculty /Centra, 1981 ; Marsh, 1980). 

Students' knowledge of an instructor's reputation and 
research productivity apparently has some relationship 
to their final evaluations, but not to their achievement 
(Perry,. Abrami, Leventhal, and Check, 1979). The ma- 
jority of studies have found small , insignificant, or no re- 
lationships between an instructor's research productivi- 
ty (numbers of books and articles published) and magni- 
tude— a large number on the positive or negative side — 
of student ratings (Aleamoni, 1.981 ; Centra, in press). 
Jlesearchers have also considered an instructor's 
.. teaching experience and teaching load. According to 
Centra's 1981 summary of research, teacher evaluations 
tend to improve in the first few. years but tend tadecline 
after about 12 years. Centra found no evidence to indi- 
cate that evaluations were lower for faculty with larger 
teaching loads; in fact, he found the opposite result in 
•sonfe circumstances. ^ 

Studnit Variables. Ag^ appears to have little relation- 
ship to the magnitude ofstudents' evaluations (Centra, 
,1981; MciCeachie, 1979). There is some evidence that^ 
when the gender of both student and instructor is the 
same, higher evaluations may result on some teaching 
• dimensions (Aleamoni, 1981; Centra, 1981; Mc- 
Kcachie, 1979). Also, neither personality (Abnimi, 
V Perry and Leventhal, 1982) nor student/instructor at- 
titude similarity (see Mizener and Abrami, 1981) ap- 
pears to have any systematic relationship with student 
ratings. Finally, Alearfioni (1981), Centra (1981) and 
Feldman (1978) reviewed research and found that while 
students taking elective courses tend to provide higher 
ratings, there is no statistically significant relationship 
between a student's major and his or her rating of a 
course or instructor. 

Multivariate Studies. Recently, three studies were re- 
ported in which many of these variables were simultane- 
ously related to instructional effectiveness. Stumpf, 
Frecdman and Aguanno (1979) took the average class 
ratings given to 129 instructors in all courses taught over 
two semesters, and studied their relationships to several 
important background variables. Their results indicate 
Only a minor relationship between average ratings and 
background variables, suggesting that these external 
factors do not unduly influence the ratings. 

Marsh (1980) examined therelationsb'p between stu- 
dents' evaluations and a set of 1 6 background variables. 
His multivariate analyses indicate that only 14 percent 
or less of the variance in the ratings on nine individual 
rating dimensions or two overall summary items can be 
-^»-*ned by the^entire set of 16 background variables. 
gp!(^"''ost inpji^|tial variables were Prior Subject In- 
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and percentage ofstudents taking the course for General 
Interest. In a subsequent analysis (Marsh^tad Coopef, 
1981), Prior Subject Interest was found to account for 
the largest proportion of variance in students' evalua- 
tions (5.1%). This variable also accounted for about 
one-third of the relationship between ratings and ex- 
pected grades. 

Dimensions of Teaching 

What aspects of teaching are students actually evalu- 
ating? Common sense and a growing body of empirical 
research indicate that properly constructed evaluation 
questionnaires can provide data on several dimensions 
of teaching. However, it is important to establish the ex- 
istence of these dimensions through factor analysis 
before assuming a priori that a group of rating items 
necessarily reflects them. 2 ' 

, Frey (1978) analyzed student ratings of instniction 
and presented solid evidence for two factors or dimen- 
sions-Skill and Rapport., Marsh (in press), based on 
his work with, the Student Evaluation of Educational ' 
Quality (SEEQ) questionnaire both in this country and . 
in Australia, has identified die following dimensions: 
^ Assignments^/Readings, Breadth of Coverage, Exam- 
inations/Grading, Group Interaction, Individual Rap- 
port, Instructor Enthusiasm, Learning/Value, Organ- ' 
ization/Clarity, and Workload/DifTicuIty. Most of the 
research in this area, as exemplified by the findings of 
Frey and Marsh, indicate that properly constructed 
.questionnaires will reflect this multi-dimensionality. 

Validity 

To what extent do students' evaluations of faculty cor- 
relate with variables thought to reflect effective teach- 
ing? Because there is no single criterion of effective 
teaching, several criteria have been selected by consen- 
sus as being indicative of instructional effectiveness. Us- 
ing this construct validation approach,l;he following cri- 
teria appear most relevant. • 

Achievement. Recent research reported by Centra 
(1977).and^Marsh and Overall (1980) has found posi- 
tive, significant correlations between average class 
achievement and magnitude of end-pf-term evalua- 
tions. Cohen's 1981 review of 41 separate multi-section 
validity studies reported that student achievement cor- 
related 0.43 with the overall instructor rating and 0.47 
with the overall course rating. Mean correlations of 0.22 
or higher were noted for al 1 typical rating dimensions ex- 
cept Course Difficulty. 

Ijistmctor Self-Evaluations. Braskamp, Caulley and 
Costin (1979), Doyle and Crichton (1978), and, Marsh, 
Overall and Kesler (1979) studied the relationship be- 
tween students' evaluations of individual faculty and the 
self-ratings of these faculty. In every case, significant 
positive relationships were found, indicating that in- 
structors and students tended to agree on the effec- 
tiveness of instruction in a variety of situations. 

Improved Attitudes Toward the Subject. Focusing on an 
area of increasing interest to researchers. Marsh and 

^Anyone wishing to develop a questionnaire independently should 
first read Abrami, Leventhan and Dicken's excellent 1981 discussion 
of*'the multi-dimensionaiity issue. Should one wish to review 
ayjiilable instrumems tliat have undergone extensive testing. 
Braskamp s appendix m Centni's 1981 book provides summary infor- 
mation about the most widely used questionnaires. 



• ^Pyerall (1980) Investigated the relationship betwe^^^^ 
^^,,studehts* evaluations and their reported changes in sub- 
■ jcc.t matter irifbrest. Positive, statistically significant cor- 
. relations were found between end-of-term attitudes to- 
ward the Subject and evaluations received by instructors 
on most rating dimensions. / ; 

The "Doctor Fox" Effect 

, To what extent can students be lured into providing 
higher ratings, regardless of lecture content? What are 
the implications of this research with respect to the va* 
lidity of students' evaluations? In ^1973, Naftulin, Ware 
and Dohnelly reported results from a study suggesting 
that^ student assessments of teaching effectiveness were 
more a. function of an instructor's wit and^personality 
than of the educational content of his or her lecture. This 
investigation and subsequent related research, has been 
called the ^Doctor Fox" studies. The name is ba.sed on 
the pseudonym used by the instructor-a professional 
actor— who presented the initial lecture. 

Further research on this efiect by Naftulin, Ware and 
Donnelly concentrated on the use of video tapes cover- 
ing six lectures presented by one.professional actor. In- 
structor expressiveness apd lecture content were sys- 
tciTfatically manipulated in each of these lectures in an 
attempt to replicate the Doctor Fox effect. A review of 
these studies by Ware and Williams ('1.979, 1980) led 
them to conclude that'differences in content consistently 
explains .much less variance in students' overall evalua- 
tions than do differences in expressiveness. 

A 1982 reanalysis of data from the Ware and Wil- 
liams studies by Marsh and Ware focused on five spe- 
. cific^teaching dimensi6ns: Clarity/Organization, Con- 
cern, Enthusiasm, Knowledge, and ability to stimulate 
Learning. Among other findings, they report that in 
manipulating instructor expressiveness, only ratings on' 
the Enthtisia^sm dimension were affected; in manipulat- 
ing the content coverage, only ratings on the Knowledge 
dimension were affected. This research indicates the im- 
portance of using individual evaluation dimensions in 
addition to overall summary items. It also demonstrates 
that even if a variable affects instructor ratings, the 
ratings are not necessarily invalid. 

Abrami,-Leventhal and Perry (1982) also reviewed 
the Doctor Fox research. They found that expres- 
siveness has a much larger effect on ratings than on 
achievement, that content has a much larger effect on 
achievement than on ratings, and that for either ratings 
or achievement, the effect of content does not vaiy to an 
important extent over levels of expressiveness. They 
conclude that their findings on the validity of students' 
evaluations — and similar ones in previous research — 
can be used as evidence only by first documenting the 
importance of expfessiveness to-instruction. They rea- 
soned that because such documentation is" absent from 
earlier research, the Doctor Fox studies- Viewed as a 
measure of instructional processes at work in the field- 
tell little about the validity or invalidity of students' 
evaluations. 



V agreement •aniong 'difierent students evaluating the 
same course and instructor (interrater agreement), or 
^ agreement among differentltems puq^orted to measure' 
the same trait (internal consistency). Feldman's 1977 re- 
view of this research found that single-rater reliability, 
: when based on a class size of about 20, has reliability co- 
, efilcients generally greater than 0.80. He thus found the 
' ^^^^'^^^yofstudents' evaluations to compare favorably ' 
with the best objective tests if the evaluations are based 
on a sufficient number of student responses. 
. In discussions of the stability of students'. evaluations 
It is not uncommon to hear the proposal that retrospec- 
tive (follow-up) ratings from graduates should be used 
as the basis for ilistmctor evaluation rather than cnd-of- 
term assessments provided by continuing<5tuderits. The 
rationale for this view is based on the suggestion that fol- 
low-up ratings will in some ways differ significantly 
from cnd'of-term ratings because they allow former stu- 
dents to develop additional peispectives'about, and to 
obtain emotional separation from the person and situ- 
ation being assessed. These follow-up ratings would 
thus be based on more informed, rcflecdve, and mature 
judgments. • * 

Overall and Marsh (1980) compare ratings that in- 
dividual students provided at the end of their courses ^ 
with subsequent ratings collected from the same stu- 
dents a minimum of one year after graduation. They 
note insignificant absolute differences between the two 
sets of ratings, and find a median correlation of 0.83 for 
all ratingdimensions. They conclude that there were no 
practical differences in the information provided by 
either set of evaluations, and that these evaluations ap- 
pear to be quite stable over time. 



Reliability/Stability 

Do students within the same class agree on the effec- 
tiveness of their instructorsf' Are student ratings stable 
over time? Research on .^he reliability of students' cval- 
O has focused either on the extent to which there is 
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Gcncralizability 

Are some courses rated less favorably tlian otherson a 
systematic basis? What is the relative importance of the 
particular instructor and the specific course in determia-pr^ 
ing student ratings? Research conducted by Gillmore, 
Kane and Naccarato (1978) led them to conclude that 
with the instructor as the object of measurement, mod- 
erately dependable results can be obtained by generaliz- 
ing over rating items and students. Furthermore, they 
conclude that the specific course is not a major factor in 
determining course evaluations. 

Marsh and Overall (1981) studied the relative con- 
tribution of the instructor, course level (graduate/un- , 
dergraduate) and course type (nonquantita- - 
tive/quantitative) to variance in end-of-term and ret- 
rospective student evaluations. They found that thp in- 
dividual instructor performance accounts for 5 to 10 
times as much variance in both'sets of ratings as did 
course level or type, suggesting that the particular 
course subject matter has little effect on student ratings 
and that tli'^e same instructor would probably receive • 
similar ratings in a different course. ^ 

In a subsequent study. Marsh (1981) used path anal- 
ysis to demonstrate that the instructor is the most im- 
portant determinant of student ratings. Relative to the 
instructor, the particular course being taught plays a 
small role. * . 

The results of thlg^e studies show consistently that the 
instructor is more important than the course being eval- 
uated. Thus, ratings obtained by an instructor in one- 
• type of course do no.t necessarily put him or her at adis- 
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• . advantage when .one instructor's evaluations are com' 
pared with evaluations of instructors teaching other 
types of courses. • 

Usefulness . • 

To what extent is feedback from students' evaluations 
of faculty associated with insti«uctionaT improvement? 
Many researchers are concerned with the question of 
whether instructors who receive summaries of evalua- 
tions from their students tend to" become more effective 
teachers and whether students* evaluations are an effec- 
tive source of feedback in the instructional improvement 
: process. *While early research examined by McKeachie 
(1979) and Rotem and Glasman (1 979) found results iri- 
conclusive, more recent studies indicate a positive asso- 
ciation. Previous research utilized printed feedbackpro- 
vided to each instructor at the end of the term. More re- 
cent studies have focused on the addition of individual 
peer consultation $o ^ssist instructors with interpreting 
and utilizing results from the printed feedback. 

Overall and Marsh (1979) found fhat instructors who 
receive written feedback from thrAr students at midterm 
also received more favorable ratfngs at the end of the 
term. Their students earned higher final examination 
scores and reported more favorable affective outcomes. 
^ The key ingredient here, they conclude, is the use of an 
" external consultant to interpret the written summaries 
along the lines suggested by McKeachie and Lin (1975). 
In a subsequent study, McKeachie et al. (1980) ob- 
tained similar findings. They conclude that presentation 
. of encouragement and suggestions to an instructor, in 
addition to printed feedback, results in a more effective 
approach to instructional improvement, 

Cohen's 1980 meta analysis of,feedt)ack studies con- 
tarns an excfcllentfiummary of research in this area. Re- 
viewing 22 college-level studies concerned with this 
ijpic, Cohen found a general accentuation of student 
^ rating feedback effects when printed summaries were 
" augmented by consultation. These effects were more 
pronounced for some rating dimensions than for others. 
Cohen also noted that the positive impact of feedback is 
not dependent on whether the feedback is used in a mid- 
term/end-of-term or a term-by- t^prm sequence. 

Conclusions 

A review of recent researcl? concerned with student 
ratings of teaching indicates that such evaluations are 
not significantly influenced by" background variables, 
and are valid, reliable, stable, generalizable, and useful. 
This research further indicates that properly con- 
structed evaluation questionnaires assess multiple di- 
mensions of the instructional process. Thus, when ob- 
tained from properly designed questionnaires, data 
'from students' evaluations is particularly appropriate 
and useful in the instructional evaluation procesl^ ■ 
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